Sains Malaysiana 54(8)(2025): 2087-2097
http://doi.org/10.17576/jsm-2025-5408-17
Improved
Robust Principal Component Analysis based on Minimum Regularized Covariance
Determinant for the Detection of High Leverage Points in High Dimensional Data
(Penambahbaikan Analisis Komponen Utama berdasarkan Penentu Kovarian Teratur Minimum bagi Mengecam Titik Tuasan Tinggi untuk Data Dimensi Tinggi)
HABSHAH MIDI1,2,*, JAAZ SUHAIZA1,3, MOHD ASLAM1,2, HANI SYAHIDA2 & EMI AMIELDA3
1Institute for Mathematical
Research, Universiti Putra Malaysia, 43400 UPM
Serdang, Selangor, Malaysia
2Department of Mathematics &
Statistics, Universiti Putra Malaysia, 43400 UPM Serdang,
Selangor, Malaysia
3Faculty of Computing & Multimedia, Universiti Poly-Tech Malaysia, 56100 Cheras, Kuala Lumpur,
Malaysia
Diserahkan: 22 April 2024/Diterima:
13 Mac 2025
Abstract
This
paper presents an extension work of robust principal component analysis
(ROBPCA) denoted as IRPCA, to improve the accuracy of the detection of high
leverage points (HLPs) in high dimensional data (HDD). The IRPCA employs the Principal Component Analysis (PCA) to reduce the
dimension of the data set and subsequently a robust location and scatter
estimates of the PC scores are obtained based on the Minimum Regularized
Covariance Determinant (MRCD). Instead of using robust score distance to detect
HLPs as in ROBPCA; in the proposed IRPCA, we have considered using Robust Mahalanobis distance (RMD). The performance of the IRPCA is compared to the ROBPCA and the Minimum
Regularized Covariance Determinant and PCA-based method (MRCD-PCA) for the
identification of HLPs in HDD. The results signify that all the three methods are very successful in the
detection of HLPs with no masking effect. Nonetheless, the ROBPCA suffers from
serious swamping problems for less than 30% of HLPs. The proposed IRPCA and the
MRCD-PCA have similar performance, having very small swamping effect. However, the MRCD-PCA algorithm is quite cumbersome
and required longer computational running time. The
attractive feature of the IRPCA is that it provides a simpler algorithm and it is
very fast.
Keywords:
High Leverage Point; minimum regularized covariance determinant; principal
component analysis; robust mahalanobis distance
Abstrak
Kertas ini membentangkan kerja lanjutan bagi Analisis Komponen Utama Teguh (ROBPCA) ditandakan dengan IRPCA, untuk meningkatkan ketepatan pengecaman titik tuasan tinggi (HLPs) dalam data dimensi tinggi (HDD). IRPCA menggunakan Analisis Komponen Utama (PCA) bagi menurunkan dimensi set data dan seterusnya penganggar lokasi dan skala skor PC dikira berdasarkan Penentu Kovarian Teratur Minimum
(MRCD). Dengan tidak menggunakan jarak skor teguh untuk pengecaman HLPs seperti ROBPCA; dalam kaedah IRPCA yang dicadangkan,
kami telah mempertimbangkan penggunaan Jarak Mahalanobis Teguh (RMD). Prestasi IRPCA
yang dicadang dibandingkan dengan kaedah ROBPCA dan kaedah Penentu Kovarian Teratur Minimum dan PCA
(MRCD-PCA) bagi mengecam HLPs dalam HDD. Keputusan menunjukkan ketiga-tiga kaedah sangat berjaya dalam pengesanan HLPs tanpa kesan penyorokan. Walau bagaimanapun, ROBPCA mengalami masalah kesan limpahan yang serius apabila terdapat HLPs kurang daripada 30%. Prestasi IRPCA yang dicadangkan dan ROBPCA ada lah sama; mempunyai kesan limpahan yang sangat kecil. Namun begitu, algoritma MRCD-PCA agak rumit dan memerlukan masa yang panjang. Sifat menarik bagi IRPCA ialah ia memberi algoritma yang mudah dan masa pengiraan yang singkat.
Kata kunci: Analisis komponen utama; jarak Mahalanobis teguh; penentu kovarian teratur minimum; titik tuasan baik
RUJUKAN
Agostinelli, C.,
Leung, A., Yohai, V.J. & Zamar, R.H. 2015. Robust estimation of
multivariate location and scatter in the presence of cellwise and casewise contamination. Test 24(3): 441-461.
https://doi.org/10.1007/s11749-015-0450-6
Boudt, K., Rousseeuw, P.J., Vanduffel,
S. & Verdonck, T. 2018. The minimum regularized covariance determinant
estimator. Statistics and Computing 30: 113-128. https://doi.org/10.1007/s11222-019-09869-x
Boulesteix,
A.L. & Strimmer, K. 2007. Partial least squares: A versatile tool for the
analysis of high-dimensional genomic data. Briefings in Bioinformatics 8(1): 32-44. https://doi.org/10.1093/bib/bbl016
Cao,
L. 2006. Singular Value Decomposition Applied to Digital Image Processing. Division of Computing Studies, Arizona State University. pp. 1-15.
http://www.lokminglui.com/CaoSVDintro.pdf
Chiang,
J-T. 2016. The masking and swamping effects using the planted mean-shift outliers models. International Journal of Contemporary
Mathematical Sciences 2(7): 297-307. https://doi.org/10.12988/ijcms.2007.07024
Dhhan, W., Rana, S. & Midi, H. 2015. Non-sparse ɛ-insensitive
support vector regression for outlier detection. J. Appl. Stat. 42: 1723-1739.
Esbensen,
K.H., Sch¨onkopf, S., Midtgaard, T. & Guyof, D. 1994. Multivariate Analysis in Practice. Camo,
Trondheim.
Habshah, M., Norazan, M.R. &
Imon, A.H.M.R. 2009. The performance of diagnostic-robust generalized
potentials for the identification of multiple high leverage points in linear
regression. Journal of Applied Statistics 36(5): 507-520. https://doi.org/10.1080/02664760802553463
Hotelling,
H. 1933. Analysis of a complex of statistical variables into principal
components. Journal of Educational
Psychology 24(6): 417-441. https://doi.org/10.1037/h0071325
Huber, P.J. 1973.
Robust regression: Asymptotics, conjectures and Monte
Carlo. The Annals of Statistics 1(5): 799-821.
Hubert,
M., Rousseeuw, P.J. & Verdonck, T. 2012. A
deterministic algorithm for robust location and scatter. Journal of
Computational and Graphical Statistics 21(3): 618-637.
https://doi.org/10.1080/10618600.2012.672100
Hubert, M., Rousseeuw, P.J. & Vanden Branden, K. 2005. ROBPCA: A
new approach to robust principal component analysis. Technometrics 47(1): 64-79. https://doi.org/10.1198/004017004000000563
Hubert, M., Reynkens, T., Schmitt, E. & Verdonck, T. 2015. Sparse
PCA for high-dimensional data with outliers. Technometrics 58(4): 424-434. https://doi.org/10.1080/00401706.2015.1093962
Jolliffe, I.T. 1986. Principal Component Analysis.
Springer Series in Statistics. Berlin: Springer.
Killeen, D.P., Card, A., Gordon, K.C. & Perry, N.B.
2019. First use of handheld Raman spectroscopy to analyze omega-3 fatty acids
in intact fish oil capsules. Applied
Spectroscopy 74(3): 365-371.
Lemberge, P.,
De Raedt, I., Janssens, K.H., Wei, F. & Van
Espen, P.J. 2000. Quantitative analysis of 16-17th century archaeological glass
vessels using PLS regression of EPXMA and μ-XRF data. Journal of
Chemometrics 14(5-6): 751-763. https://doi.org/10.1002/1099-128X(200009/12)14:5/6<751
Lim,
H.A. & Midi, H. 2016. Diagnostic robust generalized potential based on
Index Set Equality (DRGP (ISE)) for the identification of high leverage points
in linear model. Computational Statistics 31: 859-877.
Midi, H., Hendi, T.H., Uraibi, H.,
Arasan, J. & Ismaeel, S.S. 2023. An efficient method of identification of
influential observations in multiple linear regression and its application to
real data. Sains Malaysiana 52(12): 3879-3892.
Midi,
H., Ismaeel, S.S., Arasan, J. &
Mohammad, A.M. 2021. Simple and fast generalized-M (GM) estimator and its application to real
data. Sains Malaysiana 50(3): 859-867.
Midi, M., Talib, H., Jayanthi, A. & Uraibi,
H.S. 2020. Fast and robust diagnostic technique for the detection of high
leverage points. Journal of Science and
Technology 28(4): 1203-1220.
Mahalanobis, P.C.
1936. On the generalized distance in statistics. Proceedings of the National
Institute of Sciences of India 2(1): 49-55.
Maronna, R.A.
& Zamar, R.H. 2002. Robust estimates of location and dispersion for
high-dimensional datasets. Technometrics 44(4):
307-317. https://doi.org/10.1198/004017002188618509
Rana,
M.S., Midi, H. & Imon, A.H.M.R. 2009. A robust rescaled moment test for
normality in regression. Journal of Mathematics and Statistics 5(1):
54-62.
Rashid, A.M., Midi, H., Dhnn,
W. & Arasan, J. 2021. An efficient estimation and classification methods
for high dimensional data using robust iteratively reweighted SIMPLS algorithm
based on Nu-support vector regression. IEEE Access 9: 45955-45967.
Rashid, A.M., Midi, H., Dhnn,
W. & Arasan, J. 2022. Detection of outliers in high-dimensional data using
Nu-support vector regression. Journal of
Applied Statistics 49(10): 2550-2569.
Rousseeuw, P.J. 1985.
Multivariate estimation with high breakdown point. Mathematical Statistics
and Applications 8: 37.
Rousseeuw, P. & Driessen, K. 1999. A fast algorithm for the
minimum covariance. Technometrics 41(3):
212-223.
Rousseeuw, P.J. & Van Zomeren, B.C. 1990. Unmasking multivariate outliers and
leverage points. Journal of the American Statistical Association 85: 633-651.
Siti Zahariah & Habshah Midi.
2023. Minimum regularized covariance determinant and principal component analysis
- based method for the identification of high leverage points in high
dimensional sparse data. Journal of
Applied Statistics 50(13): 2817-2835.
Siti Zahariah, Habshah Midi & Mohd Shafie Mustafa. 2022. An improvised SIMPLS estimator based on MRCD-PCA weighting
function and its application to real data. Symmetry 13(11): 2211.
Varmuza, K. & Filzmoser, P. 2009. Introduction to Multivariate
Statistical Analysis in Chemometrics. Boca Raton: CRC Press.
doi:10.1201/9781420059496
*Pengarang untuk surat-menyurat; email: habshah@upm.edu.my